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Abstract 

We consider a multi-armed bandit setting that is inspired by real-world applications in e-commerce. 
In our setting, there are a few types of users, each with a specific response to the different arms. When 
a user enters the system, his type is unknown to the decision maker. The decision maker can either 
treat each user separately ignoring the previously observed users, or can attempt to take advantage of 
knowing that only few types exist and cluster the users according to their response to the arms. We 
devise algorithms that combine the usual exploration-exploitation tradeoff with clustering of users and 
demonstrate the value of clustering. In the process of developing algorithms for the clustered setting, we 
propose and analyze simple algorithms for the setup where a decision maker knows that a user belongs 
to one of few types, but does not know which one. 

I. Introduction 

Multi-armed bandit (MAB) models are a benchmark model for learning to make decisions under un- 
certainty. In the classical stochastic model 0, S), a decision maker chooses between several alternatives 
("arms") that offer uncertain and unknown payoff; through successive experimentation ("exploration") 
on the arms the decision maker learns those alternatives that are most valuable, and proceeds to use 
those in the future ("exploitation"). MAB models are used in a very wide range of application areas 
featuring stochastic decision making, including pricing, marketing, advertising, product selection, and 
recommendation systems. 

In many application areas of interest for MAB models, however, the decision maker faces two si- 
multaneous challenges. First, the decisions made may be context-specific. For example, in online recom- 
mendation systems (such as the one used by Amazon), the product recommended to a given user (the 
"arm") depends on the characteristics of the user herself: her demographics, past purchase history, etc. 
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In response to this challenge, recent literature has considered a contextual version of the classical MAB 
model 0, 0, HI, Q. 

The second challenge is that in general, the number of contexts may be quite large, and the number of 
observations per context may be quite small; this is certainly the case in most recommendation settings, 
where each user may only be interacting with the system for a relatively small number of purchases, 
and yet the number of users can be quite high. In these settings the only way for the decisions maker to 
effectively learn is to exploit latent low -dimensional structure in the high dimensional problem; that is, 
to identify a few features that capture most of the heterogeneity of contexts, or to cluster the contexts 
into a few groups with similar characteristics. 

In practice, these two challenges are addressed in separate phases. Typically, the decision maker 
estimates low-dimensional structure from high-dimensional data offline; i.e., based on previously collected 
data about contexts. After doing this estimation, the inferred low-dimensional representation is used in 
solving the online contextual bandit problem |[T0l . J9). In other words, exploration and exploitation in real 
time is restricted to learning only how a given context fits into the previously inferred low dimensional 
structure; the low-dimensional representation is only refined on much longer timescales, and is effectively 
decoupled from learning. 

In this paper we propose a model that combines both low-dimensional estimation and online learning, 
that we refer to as clustered bandits. The main motivation is that by combining the two, exploration can be 
made more intelligent. In particular, we account for the benefit not only of learning the correct decision 
for a particular context, but also how more information about a context informs the low-dimensional 
representation of the overall context space. 

In our model, we assume that users ("contexts") arrive over time, and the decision maker must choose 
the best decision for each user, from a fixed finite set of alternatives. Though there may be a large number 
of users, we assume that users come from only a fixed set of finitely many types; users of the same type 
give the same average reward on each arm. The decision maker, of course, does not observe the true type 
of each user, and does not know the average reward vector of each type. Thus the decision maker must 
simultaneously cluster the users into groups by type, and also make the best decision for each type. 

Our main contributions are as follows. First, we propose a novel model of multiarmed bandits, incor- 
porating the combination of low-dimensional estimation and online learning described above. This model 
lays the foundation for analysis of combined estimation and bandit problems that is of critical importance 
for practical applications. Second, we propose two distinct approaches to developing algorithms for this 
setting: one where we first explore, then estimate clusters, then exploit; and another where we continuously 
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cluster. We provide novel algorithms in each setting, as well as analysis of regret performance. Third, 
we use numerical experiments to demonstrate the value of clustering over time. 

II. Problem formulation 

We consider the problem of stochastic multi-user multi-armed bandits, called clustered bandits. The 
setting is as follows. There are a finite set of users U = {1,2,..., M} and a finite set of user types 
X = {1, 2, . . . , N}. We assume that every user belongs to some type, and we are interested in the case 
where M is generally much larger than N. At each time t = 1, 2, . . . , T, a user u G U arrives into an 
online system, then the system has to choose an action (arm) a £ A, where A = {1,2,..., K} is the 
set of arms; upon choosing the arm, the system obtains a reward. We assume i.i.d. Bernoulli rewards, 
whose expectations depend on both the action taken and the type of that user. In particular, let x be the 
current user's type; then the expected reward of action a is denoted by 9 x (a). For convenience, let 9 X 
denote the expected reward vector under type x, i.e., 

9 X := [e x (i),e x (2),...,e x (K)} e [o,i] K . 

Also, define: 

9* := max 9 x (a), a* := arg max 9 x (a), A x (a) := 6* — 6 x (a). 

l<a<K l<a<K 

For simplicity in our paper, we assume that a* is unique for each x, though our results extend naturally 
even without this assumption. We emphasize that an important aspect of this problem is that for all users 
belonging to the same type, say x, the expected reward vector across the arms is exactly the same, given 
by 9 X . We also call 9 X a parameter vector, and {9 x } x( zx the parameter set. 

Our goal is to propose a policy (algorithm) to take an action (pull an arm) whenever a user arrives. 
We assume that the system only has access to the following information: number of actions (K), number 
of types (N), and the user's ID when he or she arrives (u). Note that we assume the number of types is 
known; this assumption is made for simplicity, and corresponds to an ex ante determination of the number 
of "clusters" of interest to the decision maker. More generally, an interesting open direction concerns 
online identification of the right number of clusters. 

The performance of a policy is measured as regret with respect to the oracle policy that knows both 
the type of each user and the reward vector under each type. In particular, let u l be the user arriving at 
time t, x(u l ) be the type of user u l , and a* be the action taken by the policy at time t. Then the expected 
regret is defined as: 

T T 

E[Reg] = ^: (Mt) -^ (ut) (a<). 
t=i t=i 



4 



A. Discussion of existing solutions 

Since each user corresponds to a stochastic MAB problem, one naive approach is to treat users 
separately, and run a separate stochastic MAB algorithm, e.g., UCB (Upper Confidence Bound) or 
SE (Successive Elimination) Q, for each of them. The expected regret in this case is 0(M \n(T/M)). 
However, this approach does not take into account the important feature that many users may have the 
same expected reward vector. In particular, if the number of users M is much larger than the number of 
types N, treating each user separately is very inefficient because it fails to exploit latent low-diemensional 
structure. 

As mentioned in the Introduction section, this problem is also very similar to the contextual bandit 
problem 0, JSJ, Q, in which the user's ID is treated as the "context." However, the existing 
solutions for contextual bandits do not seem to be applicable to this setting. In particular, solutions in 
0, (3l involve finding the best policy among a limited set of policies mapping from the context (i.e., 
the user's ID in this case) to the action. On the other hand, O, Q consider the scenario in which each 
context is assumed to be associated with an arm configuration, and hence, their solutions are the same 
as the per-user stochastic MAB approach above. 

B. Our approaches 

Our main idea is to exploit the latent low-dimensional structure: there are only a few types of users, 
and the expected rewards from users belonging to the same type are exactly the same. This suggests if 
we can "group" the users that belong to the same type to the same group, then we can leverage the latent 
low-dimensional type information despite the potentially high number of users. To this end, we consider 
two different approaches in this paper, described below. 

Approach 1: Exploration, clustering, exploitation. First, we assume that we know the parameter set 
{&x}xex> Dut not the exact type of each arriving user. We discuss an algorithm that can efficiently solve 
this problem for one user, and inspired by this algorithm, demonstrate that the following approach can be 
used: first, we explore over an initial set of users; then we cluster from this data; and finally we run that 
algorithm for new users. This corresponds closely to actual practice in collaborative filtering systems. 

This setting uses as a subroutine an efficient algorithm in the following setting: The system knows 
the set of parameter vectors, but does not know which one is the true one. Since the system has "some" 
information about the rewards, it is expected that the regret will be better than the traditional multi-armed 
bandit setting. This setting is addressed with more details in the next section. 
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Approach 2: Online clustering. The previous approach does not include clustering as an online 
component of the learning process: it is executed once and set for the remainder of the decision horizon. 
An alternative is to continuously cluster as we leam. In particular, suppose first that we know the type of 
each arriving user (but do not know the parameter vector for each type); then the problem is reduced to 
N separate bandit problems, one for each type. Thus a reasonable approach including online clustering 
proceeds as follows: In each time step, given past observations, we use a clustering technique to divide 
users into N groups. Then we treat the problem as if the types of users are known. 

III. Detour: (single user) multi-armed bandits with known parameter set 

We make a detour in this section by considering a variation of the standard MAB problem. The variation 
itself is of independent interest, but more importantly, the results (and algorithms) for this setting will be 
used to elaborate the exploration-clustering-exploitation approach described above to tackle the clustered 
bandit problem. 

The detailed setting is as follows. There is a finite set of parameters X = {1, 2, . . . , N}, a finite set 
of arms A = {1,2, . . . ,K}, and a finite set of reward vectors {9 z } z£ x; all are known to the system. 
However, the true rewards are driven by an unknown parameter x £ X (i.e., the true reward vector is 6 X , 
although the system knows that x € X). Note that we are not considering multiple users here. 

The performance of a policy for this problem is measured as regret with respect to the policy that 
knows the true parameter. In particular, if x is the true parameter, and a 1 is the action taken at time t, 
then the expected regret is defined as 

T 

E[Reg]=T0* 

t=i 

This problem is different from the traditional MAB problem in the sense that the system has "some" 
information about the expected rewards: it is known that the reward vector belongs to a known set. Of 
course, one can ignore that fact and apply a UCB-like algorithm naively (to get an expected regret of 
0(K In T)). A more clever approach can yield the regret <9(min{A, K} InT) by considering only the 
set of optimal arms (each corresponding to a parameter) when N < K. But can we do better than that? 
In the next subsection, we will present an algorithm that, depending on the structure of parameter set, 
can achieve 0(1) regret in some cases. 

A. The UCB-KT (Upper Confidence Bound with Known Types) algorithm 

Let 9 t be the empirical reward vector up to time t. Given a parameter x, define B(x) as the set of 
parameters for which the optimal arms are better than the optimal arm for x, and the reward distribution 
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under the optimal arm for x are the same, i.e., 



B{x) :={zeX: 9 z (a* x ) = 9 x {a x ),a* z + o*} . 



Also, let £ denote the set of "elite" arms, i.e., £ := {i € A : % = a* for some x € X}. That is, £ 
includes only arms that are optimal for some parameter x 6 X. Note that \£\ < \X\ = N. 
For a value 9 G [0, 1], let us define the e-neighborhood of 6 as follows. 



Since X is finite, it is possible to find an e* > such that all e* -neighborhoods of 6 x {a) are disjoint for 
distinct values of 9 x {a), x € X, a € A 

Now, let us define the following conditions: 

. Cl(x): 9 t (a) <E e*-nbd(9 x (a)), Va € A and B(x) is em/jfy; 

• C2(x): 9 t (a) € e*-nbd(0 a; (a)), Va € A and .B(x) is non-empty; 

• C3: there does not exist x € X such that ^t(a) € e*-nbd(9 x (a)), Va € A 

The UCB-KT algorithm is as follows (its detailed pseudo-code is presented in the Appendix |A]). First, 
identify the value of e* based on the parameter set {9 x } x ^x- Then, for t = 1, . . . .,K, pull each arm 
once. For each t > K + 1: 

• If Cl(x) is satisfied for some x E X , pull a* (the optimal arm under x); 

• If C2(x) is satisfied for some x 6 X, perform one step of the UCB algorithm on the set of 
"elite" arms £, i.e., pull the arm that achieves: 



where T t (a) is the number of times that arm a has been pulled up to time t; 
• If C3 is satisfied, round-robin among all arms. 

The main idea is that even if round-robin (C3) moves us to the wrong x, if B{x) is empty, then pulling 
the optimal arm under x (CI) will eventually move our parameter estimate away from x. However, if 
the round-robin in (C3) moves us to the wrong x, and B{x) is non-empty, then pulling only a* does 
not help us at all: it cannot move us away from x when the true parameter is in B(x), because under 
a* both the wrong and true parameters yield the same reward. Thus, we need to explore more on £ and 
in the proposed algorithm we use UCB to do that. 

Theorem 1 (Upper bound for UCB-KT): Let x be the true parameter. Then there exists a constant 
> 0, which depends on e*, such that for the UCB-KT algorithm: 



e-nbd(0) := {fi € [0, 1] : \p, - 9\ < e}. 
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• if B (x) is empty, then 



E[Reg] 



UCB-KT 



(x,T) < 7a; 



• if B (x) is non-empty, then 



E[Reg] UCB _ KT (x,T) < £ 
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lnT + 7,4. 



A* (a) 



The proof is presented in the Appendix |B] The theorem says that the UCB-KT's performance depends 
on the structure of the parameter set and the true parameter. In particular, if B(x) is non-empty for the 
true parameter x, then the regret of UCB-KT is 0(min{7Y, K} In T), which is the same as the UCB-based 
approach mentioned above. However, if the set B(x) is empty for the true parameter x, then UCB-KT 
can achieve O(l) regret. 

One may ask if we can do better than O(lnT) when B{x) is non-empty for the true parameter x. The 
answer turns out to be negative. In particular, it has been proved by Agrawal et al. CD that, if x is the 
true parameter and the set B(x) is non-empty, under some mild conditions, the regret of any algorithm 
4> satisfies the following: 



where A- x := A \ {a*} is the set of all arms except the optimal arm under x, P- x is the simplex over 
A- x , and I a {x\\z) is the Kullback-Leibler divergence between reward distributions of arm a under x and 

z. 

We conclude this subsection by noting that Agrawal et al. HI also proposed an algorithm for this setting, 
and proved that their algorithm can match the lower bound in (Q]). However, their algorithm is rather 
complicated, and more importantly, requires precomputation of the optimal distribution a* that achieves 
the minimization in ([]]). This requirement turns out to be a crucial point that makes their algorithm 
inapplicable for the clustered bandit problem. On the other hand, the proposed UCB-KT algorithm is 
simpler, offers a competitive performance, and as we will show later, can be modified to apply to the 
clustered bandit setting. 

B. Numerical results 

In this section, we perform a numerical experiment to verify the performance of UCB-KT. The 
experiment includes a parameter set of N = 21 parameter vectors, indexed from to 20. The number 
of arms is K = 21 (also indexed from to 20). For parameter x = 0, we have that 6q(0) = 0.55 and 



lim inf 

T-s-oo 



E[Reg]^(x,r) 
lnT 




a(a)A x (a) 



(1) 
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Fig. 1. UCB vs. UCB-KT vs. Agrawal et al. 



#o(o) = 0.5 for a = 1, . . . , 20. For parameters x = 1, . . . , 20, we have that 9 X (0) = 0.55, 9 x (x) = 0.6, 
and 9 x (a) = 0.5 for a ^ x. Therefore, B(x = 0) is non-empty, while B(x) is empty for x ^ 0. 

We run the experiment for 100 runs. For each run, we pick x = as the true parameter with probability 
1/2, while picking each of the rest as the true parameter with probability 1/(2 * 20). (In this way, B(x) 
is empty for roughly half of the runs.) Figure [T] shows the average regrets of three algorithms: UCB (on 
the set of optimal arms), UCB-KT, and Agrawal et al. (the optimal distribution a* for this parameter set 
can be pre-computed easily, and we omitted the error bars because the variation was small). One can 
see that UCB-KT performs better than UCB but not as good as the Agrawal et al. algorithm, which is 
expected from the theory. 

C. The UCB-KT(5) algorithm 

In this last part of the detour, let us present a variation of the UCB-KT algorithm, called UCB-KT(5), 
in which we introduce some disturbance to the set B(x). This variation will be useful for the development 
of clustered bandit algorithms in the next section. 

UCB-KT(c)): same as the UCB-KT algorithm, but using the following B(x,5) instead of B(x): 

B(x,5) = {z : for some a' s.t. 8 x (a') > swp0 x {a) - 25, there holds \0 z (a!) - 6 x (a')\ < 25, 

a 

while there is at least one a / a' such that 9 z (a) > 9 z (a') — 25}. 
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Theorem 2 (Upper bound for UCB-KT(5)): Let x be the true parameter. Then there exists a constant 
7A > 0, which depends on e*, such that for the UCB-KT(<5) algorithm: 

• if B(x,8) is empty, then 

E[Reg] T) < 7a; 

• if B(x,5) is non-empty, then 

E[Reg] ucwnw (z, T) < V InT + 7A . 

Note that -B(x) C B(x,5) for any 5 > 0. Thus, if B(x,5) is empty, is also empty. The proof 

then follows the same steps as the proof of Theorem [T] We omit the details. 

IV. Algorithms for clustered bandits 

Based on the ideas and results developed before, in this section, we propose three algorithms for the 
clustered bandit problem. The first two algorithms are based on the exploration-clustering-exploitation 
approach and the results obtained in Section [Till while the last algorithm is based on the online clustering 
approach. 

A. Unif- Clustering - UCB-ET(6) 

This algorithm has an input parameter Mq and works as follows: For the first Mq users (called pilot 
users), sample the arms uniformly at random. After a large enough number of pilot users, perform a 
clustering algorithm (with N being the number of clusters) to obtain N estimated parameter vectors 
{6 X }- Then, run the UCB-ET(c)) algorithm for the new (non-pilot) users. 

UCB-ET(5) (UCB with Estimated Types): same as the UCB-KT(<5) algorithm, but using the following 
B(x, 5) instead of B(x, 5). (The only difference between them is that B(x, 5) is defined on the estimated 
parameter vectors {9 X } instead of the true parameter vectors {0 X }.) 

B(x,5) = {z : for some a' s.t. 6 x (a') > sup^a) — 25, there holds \9 z (a!) — 9 x (a')\ < 25, 

a 

while there is at least one a / a such that 9 z (a) > 9 z (a) — 25}. 
The detailed pseudo-code of the algorithm is presented in the Appendix O 
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B. UCB - Clustering - UCB-ET(5) 

This algorithm is the same as the previous algorithm except for one point: for the first Mq users, 
we sample arms according to the UCB policy rather than uniformly. The detailed pseudo-code of the 
algorithm is presented in the Appendix |D] 

This algorithm and the previous algorithm illustrate why the algorithm by Agrawal et al. is not 
applicable for this approach to the clustered bandit problem: running the algorithm by Agrawal et al. 
requires the computation of the optimal distribution a* (that achieves the minimization in £[))) on the 
estimated parameter vectors, which cannot be precomputed (since the estimated parameter vectors are ex 
ante random). 

C. Continuous clustering & UCB 

This algorithm is based on an online clustering approach (Section III-BI) : at every time slot, perform 
a clustering algorithm (with N being the number of clusters) on all users' empirical reward vectors to 
divide them into N groups. We then run N separate UCB policies, one per group; i.e., at the current 
time step, we then take an action according to the UCB policy specialized to the group containing the 
current user. The detailed pseudo-code of the algorithm is presented in the Appendix [EJ 

V. Numerical results 

In this section, we perform a numerical experiment to evaluate the performance of the proposed 
algorithms for clustered bandits. The clustering method being used is /c-means. The experiment includes 
a parameter set of N = 2 parameter vectors, each with K = 4 arms: 9q = [0.6, 0.5, 0.5, 0.5] and 
6i = [0.5,0.6,0.5,0.5]. (That is, both B(Q) and B(l) are empty.) In total, there are 2000 users arriving 
over time, and each of them stays for exactly r = 100 time slots. Each arriving user is of type with 
probability 1/2 and of type 1 with probability 1/2. 

Figure [2] shows the average regrets of various algorithms: UCB-per-user, /c-means & UCB contin- 
uously, Unif - /c-means - UCB-ET(<5) with M = 40 and 6 = 0.01, UCB - /c-means - UCB-ET(<5) 
with Mo = 40 and 5 = 0.01, Agrawal et al., and UCB-on-types (again, we omitted the error bars 
because the variation was small). Note that the Agrawal et al. algorithm requires the parameter set to 
be known in advance, and the UCB-on-types algorithm requires the type of each user to be known in 
advance. Thus, the Agrawal et al. serves as the lower bound for the Unif - /c-means - UCB-ET(<5) and 
the UCB - /c-means - UCB-ET(<5), while the UCB-on-types serves as the lower bound of /c-means & 
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Fig. 2. Regrets of various algorithms for clustered bandits 

UCB continuously. We can see that all the proposed algorithms perform better than the UCB-per-user 
algorithm, and particularly, the £>means & UCB continuously works extremely well. 

The preceding insight is an important result of our paper. In particular, this demonstrates the value of 
online clustering in such settings: the decision maker can do significantly better through more frequent 
re-estimation of the latent low-dimensional structure in the system. 

VI. Theoretical analysis 

In this section, we outline an approach to regret bounds for algorithms that jointly cluster and learn. Our 
approach consists of two steps. First, we assume that the clustering algorithm we use can be characterized 
by an error probability that depends on the number of samples observed and the desired confidence. Next, 
we demonstrate how this performance can be coupled to our earlier theoretical analysis of UCB-KT(<5) 
to obtain regret bounds. 

For this section, for concreteness, we consider the Unif-Clustering-UCB-ET(<5) scheme discussed in 
Section HV-Ai For purposes of a theoretical model, we assume that users arrive over time, stay for exactly 
r time slots, and then leave. Further, we assume that each user is sampled uniformly at random from the 
parameter set X. 

Recall that Unif-Clustering-UCB-ET((5) works by using a clustering scheme after the first Mq users 
are sampled. We start with the following assumption on this clustering scheme. 
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Assumption 1 (Hypothetical clustering scheme): Fix 5 > and Mq. Suppose that for the first Mq 
users, we sample arms uniformly at random at every time step from the arm set A; we then cluster the 
users into N clusters based on their empirical average reward vectors. Let 9 x (a) be the empirical average 
reward of users assigned to cluster x. We assume the clustering algorithm is such that: 

P(minmax|^ (:c) (a) - 9 x (a)\ > 5) < g(S,M ), 

a x,a ^ ' 

for some function g that is increasing in 5 and decreasing in Mo. (Here a ranges over all permutations 
of X.) 

The previous assumption says that (up to a relabeling of the empirical reward vectors for each cluster), 
our clustering scheme is able to identify the correct clusters with high probability. Now observe that on 
the complement of the event identified in the preceding assumption, we will have learned the correct 
cluster centers to within confidence 5. Thus on this event we expect that UCB-ET(<5) will perform well. 
On the event in the preceding assumption, however, our regret can be as bad as linear. Further, as we 
lower 5, we make B(x, 6) smaller, but at the expense of higher potential error in clustering. Thus the 
tradeoff is between the high regret when we fail to cluster effectively, and the high regret when we cluster 
to within a 5 bound but B(x,5) is too large to give low regret. 

With this in mind, consider the Unif-Clustering-UCB-ET(5) scheme in which we do uniform sampling 
for the first Mq, then cluster by the hypothetical clustering algorithm in Assumption [T] and then run 
UCB-ET(c)) for the remaining M — Mq users. Then the regret of this scheme is upper bounded as 
follows. 

Theorem 3: Suppose that users arrive sequentially over time, sampled uniformly at random from the 
parameter set X, and stay for r periods each before leaving. The expected regret of the above Unif- 
Clustering-UCB-ET(<5) scheme is upper bounded in the following result. 

E[Reg] < M V 1 V ^Ml + g(S,M )(T - M r) (max A a (x)) 

' — ' iv ' — ' J\ x,a 

x&X aeA 

+ (1 - g(S,M )) (- - Mo) ^E[Reg] UCB _ KT(M) (x,r), 

where UCB-KT(^) is as defined in Section llTTCl 

The proof of the theorem involves showing that the regret of UCB-ET(<5) is bounded above by the 
regret of UCB-KT(2<5). The result follows if on the complement of the event in Assumption [T] we have 
B(x,5) C B(x,25); and this follows by a straightforward calculation from the definitions. We omit the 
details. 
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The preceding theorem formalizes the tradeoff discussed above. In particular, in optimizing the regret 
bound, two parameters are considered. First, by increasing Mo, we obtain better confidence in our 
clustering (and thus a smaller g(5,M ), at the expense of high regret in the initial phase. Second, by 
increasing 5, we also obtain a smaller g(S, Mo), but at the expense of potentially larger sets B(x,25), 
and thus higher regret in the final phase. For a given clustering scheme (and thus a given g, minimizing 
over Mo and 5 would yield the best regret bound of this type. 
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Appendix 



A. The UCB-KT algorithm 
Algorithm 1 UCB-KT 

Input: t (current time), K (number of arms), {9 z } ze x (set of reward vectors), e* (parameter) 
Output: It (index of the arm to be pulled) 

if t < K then 

It ■<— t (pull each arm once) 
else 

if Cl(x) is satisfied for some x € X then 

It ^— a* (the optimal arm under x) 
else if C2(x) is satisfied for some x € X then 

It <— argmax 9 t (a) + J y ^ (UCB on the set of "elite" arms £) 

aes y T t (a) 

else 

It ^— ((t — 1) mod K) + 1 (round-robin among all arms) 
end if 
end if 



B. Theorem [7] and its proof 

First, we prove the following lemma. 

Lemma 1: Let Xi,X2,... ,X n be i.i.d. Bernoulli random variables with mean /i. Let 

1 n 

]Z := - y Xi, and L t = sup{n > 1 : |/2 - fj,\ > e}, 

i=l 

for some e > 0. Then, 



2e~ 2t 

E[L £ ]<7(e), where 7 (e) := 



Proof: By the Chernoff-Hoeffding bound, we have that: 
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Note that 



E[L e ] = E 



J^l(3» > n : \Hi -fi\ > e) 



_n=l 

oo oo 



E 



£l ( U (l^i - Ml > 



n=l \ i>n 



2e 2 i 



n=l i=n 

00 e -2e 2 n 



n=l i=n 



2 £r 

n=l 



2e 



-2e 2 



-2e 2 



(1-e 



-2e 2N |2 ' 



Now, recall that x is the true parameter. For a ^ a*, we have that: 

n n 

T n {a) = Y J ±{It = a) = l+ 1 ( I t = a) 



t=i 



t=iC+l 



1 + = a, Cl(z) is satisfied at time t for some z £ X) 



t=K+l 



is satisfied at time t for some z & X) 



t=K+l 
n 



l(Jj = a, C3 is satisfied at time t) 

t=K+l 

1 + Term 1 + Term 2 + Term 3 . 



(2) 



Let us define C a := sup{T n (a) > 1 : 9 n {a) g" e*-nbd(9 x (a))}. Then by Lemma [Q E[£ a ] < 7(6*). 
Moreover, we have that Term 1 < £ a (since B(z) is empty in this case), and Term 3 < ^ a£ j\ L £ a - 
Therefore, 

E[Term 1] < 7 (e*), (3) 



and 



E[Term 3] < Kj(e*). 
Now, we note that Term 2 can be expanded as: 

n 

Term 2 = l(I t = a,C2(z) is satisfied at time t for some z ^ 

t=K+l 
n 

+ l(/ t = a,C2(x) is satisfied at time t) 

t=K+l 

= Term 2a + Term 2b. 



(4) 



(5) 
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Let us consider the case when B(x) is empty. Then we have that: 



E[Term 2a] = E[l(it = a,C2(z) is satisfied at time t for some z)\ 

t=K+l 



is satisfied at time t for some z)]l(a € 8) 

t=K+l 



(if B(x) is empty). 



(6) 



The first inequality of the above expression is due to the following fact: under C2(z), the UCB algorithm 
is run over £ , which also includes a*; and since x is the true parameter, the expected time spent on a* 
is larger than the expected time spent on any other arm. The second inequality is due to the following 
fact: the condition C2{z) is satisfied for some z/i, which means that B{z) is non-empty; and since 
B(x) is empty, it must be that 6 x (a* x ) ^ z (a%). 
Now, if B(x) is non-empty, then we have that: 



n 



E[Term 2a] = E[l(Jt = a, C2(z) is satisfied at time t for some z)] 

t=K+l 




(if B(x) is non-empty). 



(V) 



The above inequality is due to the UCB result in (H. 



Finally, we have that: 




(8) 



Combining ©-([8]) yields the result. 
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C. The Unif - Clustering - UCB-ET(5) algorithm 

Algorithm 2 Unif - Clustering - UCB-ET(<5) 

Input: Mq (number of pilot users), N (number of types), K (number of arms), 8 (parameter) 

Output: Ji , I2, ■ ■ ■ (indices of the arm to be pulled) 

Initialize: Oi <— for i = 1, . . . , N (estimated parameter vector for each type), P <— (pilot set) 
Initialize: fi(u) for u € U (empirical reward vector for each user) 
t = 1 (time starts) 
while t > 1 do 

Obtain the user's ID ut 
if ut is a new user then 

if |P| < Mo (number of pilot users is less than Mo) then 
P ^ PU{u t } (add u t to the pilot set) 
It <— a uniformly random chosen arm 
else 

It ^— UCB-ET(J) with the estimated parameter vectors {6i} 
end if 
else 

if ut € P (u t is a pilot user) then 

It <— a uniformly random chosen arm 
else 

It ^— UCB-ET(J) with the estimated parameter vectors {6i\ 
end if 
end if 

Obtain the reward, update the empirical reward vector /J,(ut) 
if \P\ > M and u t G P then 

Do clustering on {/j, u } U £P to obtain the estimated parameter vectors {0j}i=i,„,,jv 
end if 

t <r- t + 1 

end while 
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D. The UCB - Clustering - UCB-ET(5) algorithm 

Algorithm 3 UCB - Clustering - UCB-ET(<5) 

Input: Mq (number of pilot users), N (number of types), K (number of arms), 8 (parameter) 

Output: Ji , I2, ■ ■ ■ (indices of the arm to be pulled) 

Initialize: Oi <— for i = 1, . . . , N (estimated parameter vector for each type), P <— (pilot set) 
Initialize: fi(u) for u € U (empirical reward vector for each user) 
t = 1 (time starts) 
while t > 1 do 

Obtain the user's ID ut 
if ut is a new user then 

if |P| < Mo (number of pilot users is less than Mo) then 
P ^ PU{u t } (add u t to the pilot set) 
It ^— UCB for user ut only 
else 

It ^— UCB-ET(J) with the estimated parameter vectors {6i} 
end if 
else 

if ut € P (u t is a pilot user) then 

It ^— UCB for user ut only 
else 

It ^— UCB-ET(J) with the estimated parameter vectors {6i\ 
end if 
end if 

Obtain the reward, update the empirical reward vector /J,(ut) 
if \P\ > M and u t G P then 

Do clustering on {/j, u } U £P to obtain the estimated parameter vectors {0j}i=i,„,,jv 
end if 
t-^t + 1 
end while 
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E. The Clustering & UCB Continuously algorithm 

Algorithm 4 Clustering & UCB Continuously 

Input: N (number of types), K (number of arms), M t h (parameter) 

Output: Ii,l2,... (indices of the arm to be pulled) 

Initialize: n{u) 4— for u G U (empirical reward vector for each user) 
Initialize: U «— (set of current users) 
t = 1 (time starts) 
while t > 1 do 

Obtain the user's ID ut 

if ut is a new user then 

U -k— U \J {ut} (add new user) 

if \U\ < M t h (number of current users is less than M t h) then 

It UCB for user ut only 
else 

Do clustering on {/j, u }ugu to obtain N clusters 

It UCB for the cluster that ut belonging to (using data from all users in that cluster) 
end if 
else 

if \U\ < M t h (number of current users is less than M t h) then 

It UCB for user u t only 
else 

Do clustering on {/j, u }ugu to obtain N clusters 

It <r- UCB for the cluster that ut belonging to (using data from all users in that cluster) 
end if 
end if 
t + l 
end while 



